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multiple choice examinations 
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Abstract Cheating in examinations is acknowledged by an increasing number of 
organizations to be widespread. We examine two different approaches to assess their 
effectiveness at detecting anomalous results, suggestive of collusion, using data taken 
from a number of multiple-choice examinations organized by the UK Radio Commu¬ 
nication Foundation. Analysis of student pair overlaps of correct answers is shown 
to give results consistent with more orthodox statistical correlations for which con¬ 
fidence limits as opposed to the less familiar “Bonferroni method” can be used. A 
simulation approach is also developed which confirms the interpretation of the em¬ 
pirical approach. 
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Background 

Cheating in examinations is now acknowledged by an increasing number of aca¬ 
demic institutions to be widespread. Wesolowsky (2000) notes one 2005 US study 
(McCabe 2005) which reported that from 10,000 faculty, 44% were aware of student 
cheating in the previous 3 years yet never reported the fact. He also points to a sec¬ 
ond study by the Josephson Institute of Ethics also in USA. This surveyed 43,000 
high school students in October 2010, which found that more than one half admitted 
cheating on test during past 12 months and one third admitted cheating more than 
twice. Cizek (1999) as a result of a survey of studies prior to 1999 concluded that 
“cheating is rampant”. Most research since suggests cheating is more widespread 
than is usually believed. In the UK cheating has been reported to occur during med¬ 
ical examinations (McManus 2005). But other specialisms are not exempt. Nor is 
it confined to schools and universities. Many professionals, aspiring car drivers, nu¬ 
clear missile control supervisors and amateur radio operators are all required to take 
examinations in order to gain an appropriate license. It is also not unknown for the 
teacher to collude with examinees. This was illustrated by a recent article in Time 
Magazine (Docktorman 2014) which reported the indictment of a college Principal 
and four teachers for conspiracy and provision of examination answers to students 
before the examination. Multiple-choice examinations form a significant element of 
the testing process in all these cases. Here we report the development and implemen¬ 
tation of a statistical method aimed at detecting cheating in multiple choice amateur 
radio examinations. Our empirical study is complemented by a set of simulations, 
which offer additional insight into the method and outcomes. 

The next section outlines the way amateur radio examinations in the UK are admin¬ 
istered and managed together with initial approaches to cheating detection. A sta¬ 
tistical approach to detection is then described together with some output for three 
different examinations that illustrate the potential of the method. Finally we end with 
thoughts and conclusions. 

Amateur radio has its origin in the early studies by Marconi and other physicists of 
electromagnetic wave propagation at the end of the 19th and beginning of the 20th 
centuries. As a hobby it began to grow substantially after world war one. It has been 
estimated that around two million people world wide are regularly involved with 
amateur radio. In the UK for the past few years new entrants number about 1,000 
annually. To participate, entrants across the world must obtain a license. Most ad¬ 
vanced countries issue this following success in a national examination. In the UK, 
the government office of communications regulation (OFCOM) confers responsibil¬ 
ity for the examination to the Radio Communications Foundation (RCF) an indepen¬ 
dent charity established to support people and projects where radio communication, 
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and amateur radio in particular, is the theme. The examination itself provides for 
three levels of license: Foundation, Intermediate and Advanced. Here we are con¬ 
cerned with the Advanced examination which at the moment consists of 62 multiple 
choice questions that test knowledge of various aspects of radio including the li¬ 
cense conditions, basic electronics, radio and transmitter architecture, antennae and 
electromagnetic wave propagation, testing methods and safety. Each question is 
constructed with the correct answer and three distractors. The examination is held 
simultaneously at a number of approved centres located across the UK. The number 
of candidates at any centre varies depending on demand (see schematic shown in 
figure 1). As an example, in 2008 63 candidates sat for an advanced examination 
across 25 centres. The average number of candidates per centre varies from 1 to 6. 


Candidate network 



Fig. 1 : Schematic illustration of the organization of examinations across the UK with relatively small numbers 
of students taking the examination at different centres simultaneously. Source: 


Locations are vetted and the conduct of the examination follows the traditional pat¬ 
tern found in most schools and academic establishments. Invigilators are required to 
be named and deemed suitable by the RFC prior to the examination. However un¬ 
like schools, the invigilators are volunteers and unpaid. Optical mark sheets record 
the answers and for each candidate one has therefore a 62 character string, viz: 
ABDC... BCADC plus two additional identifiers for the candidate and the par¬ 
ticular centre in which the candidate sat the examination. 

It can be noted that for confidentiality reasons all identification code numbers of 
candidates and examination centres have been removed from the results given in the 
paper. 
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Analysis of empirical data 


Table 1 Useful definitions for the analysis of multiple choice examinations 


Correct answers 


4144333414 


Answers of candidate 1 (R ,\) 

Good (iywrong (0) answers of candidate 1 (Si) 

Score of candidate 1 (A]) 

Answers of candidate 2 (R 2 ) 

Good (l)Avrong (0) answers of candidate 2 (S 2 ) 

Score of candidate 2 (A 2 ) 

Overlap of correct answers (A 12 ) 

Overlap of all answers (T 12 ) 

(Ai + A2)/2 (arithmetic mean score of the pair 1,2) 
\J A\A-i (geometric mean score of the pair 1,2) 
Correlation between i?i, Rj 
Correlation between Si, S 2 


4 3 4 4 4 1 1 4 1 2 

1 0 1 1 0 0 0 1 1 0 

5 

4144343414 

111110 1111 

9 

5 

6 

5.50 

5.48 

0.35 

0.33 


Notes: For each question, the candidates had to select one of the 4 proposed answers. In the examination there 
were 62 questions but this table is limited to the first 10 answers. Candidates 1 and 2 were the first two candi¬ 
dates in the data set of 2008. They took the examination in the same centre. Altogether for the 19 centres there 
were 110 candidates. 

Initially, in the file of the UK Radio Communication Foundation, the answers were coded in the form 
A,B,C,D. Thus, the first step was to replace A,B,C,D by a numerical coding such as 1,2,3,4, which 
can be done either with a text editor or with the Unix transliteration “tr” command. It can be seen that the arith¬ 
metic mean and the geometric mean (used by McManus 2005) are very much the same. In following sections 
then - common value will be referred to as the average score of a pair of candidates. 

Source: Radio Communications Foundation of the United Kingdom. 


In studying the data we follow a method similar to that proposed by McManus et 
al. (2005) who examined the overlap, A i3 of correct answers for each and every pair 
of candidates in their cohort as a function of the geometric mean of the scores, A t 
and Aj for the pair of students. In our case it is useful to not only consider the total 
network of all student pairs but also the sub-network formed by pairs of students 
which sat the examination in each separate centre. A feature of the scatter plot such 
as is shown in figure 2 is that each data point falls entirely inside the limited area 
defined by the dotted triangle. This may be understood when one realizes that the 
maximum value of the overlap occurs when the scores for each student are equal; the 
minimum value of the overlap depends on the values scored. If the students score less 
than 50%, the minimum possible value of the overlap is zero. For students scoring 
over 50% the minimum overlap increases linearly from zero to the maximum value 
when each student scores 100%. 
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Assuming no contact by phone, internet or other wireless means during the exam¬ 
ination, we may expect the total overlap scores for the total student network to be 
completely uncorrelated since it is dominated by pairs from different examination 
centres. This is shown clearly using a dataset from an examination taken in 2012 and 
sat by a total of 62 students equating to 1953 student pairs. 

Figure 2 shows the outcome for the overlap A, tJ plotted as a function of the square 
root of the product of individual scores and Aj. The data points for the total 
network fall about a mean, m = m [yjA l A : ^ that increases as the geometric mean 
itself increases. The solid line is a one parameter power law fit of this “mean” func¬ 
tion. We deduce m (^jAiAj) = 62 [^JAjA ]) where the exponent a is chosen such 
that the sum of the overlap fluctuations about the mean line is zero. In this instance 
a = 1.75. The distribution of these overlap fluctuations about the mean line fits well 
a Gaussian distribution. Figure 3 shows the cumulative distribution highlighting the 
final few points in the positive tail. No unusual data outside the Gaussian distribution 
is apparent. 

We may now go further and examine the smaller sub-network of student overlaps 
for only the students within centres. The small circles in Figure 2 display the results 
for this sub-network. We observe no unusual clustering between these data. Each 
point appears to be distributed throughout the entire dataset. This suggests that the 
argument often put forward that students who work together will score similarly is 
not well founded. 

However figures 4a and 4b show two further datasets from 2008 and 2011 respec¬ 
tively. 

For these examples we see a number of outliers, which become very apparent on the 
corresponding cumulative distributions. 

A similar situation prevails for the 2011 examination results. However now we see 
two extreme outliers, each representing a pair of students from the same centre. This 
approach is similar to that described by McManus (2005) who used it for medical 
examinations. However we shall now turn to a simulation of the process, which as 
we shall see gives us much greater insight into the method. 

Simulation model 

It is possible to simulate situations which mimic the examinations. Such simulations 
have two main purposes. 

• It allows us to study the influence of major parameters which define an exam¬ 
ination, namely: the number of candidates, the number of questions, the number of 
possible answers, the average level of the candidates. 

• By giving the opportunity to run many iterations, the simulation allows us to 
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Fig. 2: The overlap A rj as a function of the square root of the product of individual scores A, and Aj. Student 
pairs from the same centre are shown in big green dots whereas student pairs from different centres are shown 
in small red dots. It can be observed that an electronic version of the paper with full color figures is available on 
the arXiv data base of preprints at the following address: http://xxx.lanl.gov/find/cond-mat. Source: Dataset 
from 2012. 



Fig. 3 : Cumulative distribution for “overlap” fluctuations about the mean line (see text for definition) shown 
in figure 2. No untoward data points or outliers are evident from this dataset. The total amplitude on the log 
scale of the vertical axis of the inset is from —0.001 to +0.002. Source: Dataset from 2012. 


judge the effectiveness of different methods for the identification of cheating. 

Principle of the simulation 
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Fig. 4a,b: (a) Data for an examination held in 2008. (b) Similar data for an examination held in 2011. Both 
exhibit outliers that correspond to student pairs from the same centre. Source: Dataset from 2008 and 2011. 



Fig. 5a,b : The cumulative distribution for the 2008 examination (a). Detail from the tail of the distribution is 
in figure (b) from which it now becomes apparent that the outlier is actually three points corresponding to three 
students from one particular centre who have almost identical scores and identical overlaps. 



Fig. 6: The cumulative distribution for the 201 1 examination showing two adjacent outliers from same centres. 
A small number of large random fluctuations for candidates in different centers are already visible in Fig. 2 and 
4a,b. This is not surprising due to the large number of pairs: ~ 100 x 50 = 5,000. 
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Beyond the technical details (which are given in the Appendix) it is important to un¬ 
derstand what is the nature of the probability problem at the core of the examination 
process. 

In the examinations considered in this paper there are a = 4 answers proposed for 
each question. Let us for a moment suppose that there are only 2. Then, the whole 
examination process involving n candidates and q questions is equivalent to throw¬ 
ing q times n coins. The face-up sides in the throw number k give the answers to 
the question number k. In other words, the examination is a binomial random trial. 
Usually, in binomial trials the coins are supposed to be independent from one an¬ 
other. If, on the contrary, we suppose that two coins are in some way (electrically, 
magnetically or in any other way) in interaction, their results will become dependent 
in the same way as for the answers of two candidates who communicate with one 
another. 

If we assume that a = 6 the examination is equivalent to rolling q times n dice. As 
before, the upper sides of the n dices in the trial number k will give the answers 
of the n candidates to the question number k. In other words, the examination is 
a multinomial random trial. When two candidates communicate, it means that two 
dice are in interaction. 

In the real examination a = 4 but if we are only interested in whether the answers 
are right or wrong, the examination becomes again equivalent to a binomial trial. 

In short, the detection of cheating refers to a fundamental problem of probability 
theory, namely how to determine whether or not two random variables are dependent. 

In figure 7 we show the results of 62 students completing 62 questions. In both 
figures it is assumed that the probability of choosing the correct answer is constant 
across all questions and all students. We dont expect this to be a realistic assumption 
but we shall see that it serves as a toy model that allows us to understand how varia¬ 
tions in the various aspects of the matter affect the outcomes. A feature which is not 
possible using only the available empirical data. Figure 7a assumes a fairly smart set 
of students who answer each question correctly with a probability of p = 0.8 but the 
student pair, 1 and 2, cheat with an overlap correlation of 0.99. Figure 7b) is similar 
but the students are assumed to be less smart and the probability of answering any 
question correctly is only p = 0.5. The cheating pair is immediately spotted as an 
outlier. The distribution of the bulk of the data points is more limited than those 
for the actual student cohort but clearly a real class will have students with different 
abilities so we can expect a distribution of values for p to be more appropriate. What 
is evident from the two figures is the deviation of the data from the right hand line 
as the value of p is reduced in line with actual data. However further calculations to 
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explore this issue will be left for another study. 




Average score of pair Average score of pair 

Fig. 7 : Simulation for a group of 62 candidates. There are also 62 questions. Each dot corresponds to a pair 
of candidates. One pair, namely (1,2) is supposed to cheat with probability 0.99 in the sense that the correlation 
between their sets of results is 0.99. Figure 7a) is for candidates who answer each question correctly with a 
probability 0.8. Figure 7b) is the same simulation but the probability of answering questions correctly is only 
0.5. Aij is the overlap of correct answers. 

Figure 8 shows the degree to which correlation between the cheating pair influences 
the position of the outlier. If the correlation parameter is less than 0.5, it would make 
no sense to assign it as an outlier. But what value ought to be chosen to eliminate 
false positives? We return to this point in the next section after looking at one further 
outcome from the simulations. 

Figure 9 shows the effect on the results of changing the number of students relative 
to the number of questions in the examination paper. We see that an outlier is more 
obvious when the number of students is less than the number of questions. As an 
aside we note here that the total number of student pairs remains constant across the 
graphs; that this may not appear so reflects the fact that many points are superim¬ 
posed when integer values of the overlap coincide. 

From figure 10, we see that an outlier is also more obvious for students who are not 
so smart than when they all score highly. 

Our examination data is usually for a cohort which is greater in number than 62, the 
number of examination questions. But they are not all equally smart and moreover 
our network, as we noted at the outset is fragmented, with many centres hosting only 
a few examinees. Overlaying the sub-network on the complete network of overlaps 
as we showed earlier now helps overcome any potential masking of the outliers. 
Examples are shown in Figure 4a,b. 

The majority of data points for the sub-network are more confined forming a slightly 
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Fig. 8 The graphs show the emergence of the outlier as the correlation between the answers of candidates 1 
and 2 increases from zero to one. 



Fig. 9: For the 3 graphs there are 50 candidates but from left to right the number of questions increases from 
20 to 50 and 200. As expected the outlier is more obvious when the number of questions becomes larger. 


tighter distribution within the total network leaving the single outlier in this example 
more clearly exposed. As for the 2012 data the majority of points within the sub¬ 
network seem are more or less randomly scattered about the mean relative to the 
totality of points, so again we cannot conclude that students who study together 
gain any additional advantage over those who do not and once more we see that the 
frequent assertion that students who study together will have common overlaps does 
not pass muster. 
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Average score of pair 




Average score of pair Average score of pair Average score of pair 


Fig. 10: These graphs combine two effects: change in the smartness of the students through the probability p 
and change in the number of questions. The two effects work in opposite directions. The bottom right graph 
shows that when p = 0.9 even with a number of questions as high as 200 (which from a practical perspective 
seems to be an upper limit) the outlier is hardly detectable. 


Wesolowskyl stresses the importance of using a conservative method to identify 
false positives. He kindly agreed to process our data using his proprietary method 
and the specific cheaters he identified are the same as obtained here. He imposes a 
boundary value of 10 4 on the probabilities shown in the cumulative distributions in 
figures 5b and 6 as a light blue line. Points lying below this value are the outliers. 
We may understand this by plotting our data on a log-log scale and superimposing 
a Gaussian distribution chosen to fit the points in the main bulk of the distribution. 
Figure 11 illustrates the detail. The green line is the cumulative distribution function 
for our data; the red line is a Gaussian fit to the bulk of the data. In this case the out¬ 
lier has a probability 8 orders of magnitude greater than that which we would expect 
from the Gaussian distribution function. Simple enough to identify one would think. 
But at this point it is usual to apply a statistical test ascribed to Bonferroni in order to 
reduce the chance of false positive result^]. It is argued that the distribution function 
computed from the data represents the outcome of a single experiment. However we 
have N candidates sitting the examination. The Bonferroni correction says that the 
true probabilities ought to be N(N — l)/2 greater than the single Gaussian values. 
It is not easy to understand why this should be so. Moreover in our case the appli- 

1 For more details see the Wikipedia article entitled “ Bonferroni correction”; in particular see the references to the 
statistical papers cited in this article. 
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cation of this is a little ambiguous since candidates take the examination in small 
groups. Do we choose a value of the order of the group size typically 3 to 6 or do 
we choose N = 63 the total size of the cohort? We shall ignore this and select the 
most conservative approach which is to choose N = 63. Our correction factor is 
then of the order of 10 ' 1 which means the corrected probability is still far less than 
that observed in this case so we can be quite sure that our outliers are true outliers 
and not just erroneous points associated with the Gaussian. 


A, B 
C, B 
A, C 

A 


Outlier 
displaced 
8 orders of 
magnitude 




-10 -SO 5 10 IS 

Residual 

Fig. 11: The cumulative distribution of overlaps for the data of 2008 plotted on a log log scale. Now the 
deviation of the actual distribution (circles) from a Gaussian (points) is clear. The three data points from the 
same examination centre and suggestive of collusion are visible and well separated from the Gaussian curve. 
The Bonferroni correction offers only 4 orders of magnitude of hope leaving a further 4 orders of magnitude 
unaccounted for! 
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Analysis of the second data (figure 6) set suggests both outliers are true examples of 
anomalous pairs each being for students from the same examination centre. Although 
the penultimate point only just falls within this conservative cut-off! 

Overlaps or correlations? 

The method of looking at overlap of correct answers (as used by McManus et al. 
2005) does not use information relating to wrong answers. Moreover, and this point 
is probably more meaningful, it does not take into the specific answer numbers se¬ 
lected by the candidates. In this section we propose an alternative method based on 
the correlations between the two results vectors of pairs of candidates. In the present 
case the results vector of a candidate is a set of 62 numbers comprised between 1 and 
4. 
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Why should one use correlations when in this problem the natural variable is the 
overlap between the answers of different candidates? When a teacher begins to sus¬ 
pect some “abnormalities”, the first reaction is to assess the similarity in the answers 
of the candidates. That is why the overlap was used as a key-variable in the first part 
of this paper. 

However, the overlap is not a standard statistical concept and does not easily lead to 
an assessment in terms of confidence intervals. That is why, instead of the overlap, 
we will here use the correlation. Overlap and correlation are two measures of the 
degree of similarity between two sets of results and it is easy to check that the two 
variables increase together when the results become more identical. However by 
using correlations we can use standard theorems that give confidence interval for a 
given level of confidence probability. 

More specifically the confidence interval limits that we will use are obtained fol¬ 
lowing Morice and Chartier (1954). Lying between —1 and 1, the correlations can 
certainly not be considered as Gaussian variables. That is most unfortunate because 
the confidence intervals of Gaussian variables can be obtained most easily. Therefore 
it is natural to apply a change of variable which will transform the interval (—1,1) 
into an interval covering the whole line from minus infinity to plus infinity. The 
inverse hyperbolic tangent function does this. Then the correlations so transformed 
can be fairly well described by a Gaussian variable. One we have performed the 
standard confidence analysis for Gaussian variables we return to the correlations by 
way of applying an hyperbolic tangent function. 

Before using this method we need to decide the kind of results to which we want to 
apply the methodology. There are basically two possible options. 

The first approach is to use the results in the form “good answer” versus “wrong 
answer”. In this framework, the set of results of each candidate will be a simplified 
vector whose elements are 0 (wrong answer) or 1 (correct answer). For the sake of 
simplicity, the previous simulations have been performed using this framework. 

Alternatively we can use all the available information. As every teacher knows, 
common wrong answers are more suggestive of possible cheating than common good 
answers. For instance, if the correct answer is A, two D answers may mean either 
that the two candidates had the same misunderstanding or that there was some form 
of communication between them. In the present case where there are 4 possible 
answers the set of results of each candidate will be a vector whose elements can take 
the values 1,2,3, 4 Because it uses more information it is of course natural to expect 


2 As can easily be checked, the correlation between two sets of answers remains the same if instead of 1,2,3,4 we use 
11,12,13,14 or 1,3,5,7. Whereas the first case is obvious because it is merely a translation, the second is somewhat less 
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Average score of pair (ij) (proportion of good answers) 


Fig. 12a: Analysis of 2008 data (110 candidates) in terms of correlation between the answers to the 
62 questions. Answers are described as vectors whose elements are numbers comprised between 1 and 4 
depending on which one of the 4 possible answers the candidate selected. The thick (red) staircase curve gives 
the averages for groups of 14 successive points. The thin line gives the confidence interval of the averages (at 
probability level of 0.95). 

It can be observed that the correlations Cij of the answers used in this graph are closely related to their overlap 
in terms of right/wrong answers, that is to say the A^j used in earlier graphs; the correlation between the Cy¬ 
an d Aij is r = 0.97. 

The outliers whose confidence intervals (also with probability level of 0.95 and represented by the black vertical 
lines) do not cross the staircase curve can be considered as having a significantly abnormal correlation. 
Altogether there are 8 such pairs: the most conspicuous correspond to the three squares at the top of the 
graph almost at probability 1 (its confidence interval is so small that it does not appear on the graph). They 
correspond to the three candidates who took the examination at the same centre and identified already in Fig. 
5 by the Gaussian method. The 5 other squares, namely those numbered 2, 3, 16, 164 and 194, correspond to 
5 pairs of candidates who took their examination in 4 centres C±, C* 2 , C 3 , C 4 . We can say that pair number 2 
and pair number 3 took their examination in the same centre C\, whereas the 3 other centres were all different. 
The exact code numbers of the centres as well as the code numbers of the candidates have been removed for 
confidentiality reasons. Source: Data set from the UK Radio Communication Foundation (2008). 


obvious. 
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Average score of pair (ij) (proportion of good answers) 


Fig. 12b: Analysis of 2008 data (110 candidates) in terms of correlation between the right/wrong charac¬ 
teristics of the 62 answers. Answers are described as vectors whose elements are 1 or 0 depending on whether 
the answer is right or wrong. The staircase curves have the same meaning as in Fig. 12a. It can be observed 
that the correlations C V] of the answers used in this graph are less closely related to their overlaps A ri ; the 
correlation is r = 0.75. This is mainly due to the fact that the are lower and hence more dispersed (due to 
larger confidence intervals) than in Fig. 12a. 

With the exception of the data point number 2, the outliers are the same as in Fig. 12a. In contrast with Fig. 
12a, the correlation does not display an upward trend. This is due to the fact that for low scores the vectors of 
the answers contain many 0 in the same way as they have many 1 for high scores; this makes the situation fairly 
symmetrical; it can be noted that the high level of the first step of the staircase curve is due to the inclusion into 
the average of the 3 top data points with correlations almost equal to 1. Source: Data set from the UK Radio 
Communication Foundation (2008). 


that this second approach will lead to more accurate identification of interdependent 
results. This is indeed confirmed by observation as shown in Fig. 12a,b. 

As in previous graphs, green dots correspond to pairs of candidates who took their 
examination in the same centre. The black vertical lines shown for a number of out¬ 
liers are the lower parts of the confidence intervals for a probability level of 0.95 
(which means that if we repeat 100 times the same experiment about 95 of the cor- 








































16 


relations for the same pair of candidates will fall in the confidence interval). For the 
3 candidates which correspond to the two squares at the top (almost at correlation 
1) the confidence interval is so small that the confidence interval can hardly be seen. 
This is of course an extreme case in the sense that 99% of the answers selected by 
these 3 candidates were identical. In such a case cheating is hardly in doubt. 

However, for the other cases the situation is less clear. If we wish to interpret them 
in terms of cheating or not cheating an additional assumption is necessary, namely 
one needs to assume that the bulk of the candidates did cheat. This assumption has 
already been made in the Bonferroni method. In cases where all candidates would 
cheat the identification of outliers would only reveal the candidates who cheat more 
than the average. 

With this assumption in mind, one can adopt the rule that when the confidence inter¬ 
val does not cross the staircase line that represents the average of all candidates, there 
is a significant “abnormal” similarity. Needless to say, such a statement depends on 
the confidence level that is selected. With a confidence level of 0.99 (instead of 0.95), 
the confidence intervals would be broader and the number of significant cases would 
be reduced. 


Conclusions 

How can one optimize the detection of cheating? 

The method of overlaps, that compares correct answers for pairs of candidates, is 
shown to be effective at identifying anomalous outcomes suggestive of cheating, in 
multiple-choice examinations. Simulation confirms the method and offers insight 
into parameters that might be controlled in future examinations to optimize such 
identification. The method can be criticized for focusing on overlap only of cor¬ 
rect answers, taking no notice of overlap of incorrect answers. Intuitively it seems 
that overlap of incorrect answers could be a stronger pointer to collusion. The use 
of correlations allows the use of well-defined and more widely understood confi¬ 
dence limits for eliminating false positives, rather than the Bonferroni method which 
becomes rather ill-defined in our case where the examination cohort is split across 
numerous centres. Since it is conceivable that cases of cheating could be the subject 
of legal action, it is important that well understood methods for elimination of false 
positives are used. We show that the method of correlations appears to be equally 
applicable in our case giving identical outcomes. 

A practical way to minimize the potential for cheating in multiple choice examina¬ 
tions such as those discussed here is to randomize the order of potential answers on 
candidate question sheets and this approach is now used by the RCF in the UK. 
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Extension of the present analysis to similar problems 

This paper is about a specific problem, namely how to detect people who cheat at 
examinations. However, we believe that the identification methods proposed in the 
paper can also be used every time one wants to detect “abnormal” similarities in a 
selection process. 

This can be illustrated by the following example. We consider a group of N cus¬ 
tomers. Every customer is shown 4 brands of various products (tea, milk, beer, rice, 
cookies and so on) and is asked which one he or she prefers. 

The dispersion of the correlations (or overlaps) of the vectors of selected brands will 
provide a way to measure how similar or closely connected the customers are. The 
analog of cheating will correspond to an abnormally high correlation suggesting a 
strong connection between two customers. 

One can also introduce the notion of right/wrong answers. For instance, “right” an¬ 
swers may be defined as the brands for which there had been a recent advertisement 
campaign. Then, one can define the score of a customer as the number of his (or 
her) correct answers. The customers with highest scores (the analogue of the smart 
students) are likely to be those who watch TV the most. 
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Appendix: Simulating interdependent random variables 

In this appendix we wish to answer two related questions regarding the definition of 
a system Xi of symmetric binary interdependent random variables: 

Xi e {0,1} E(Xi) = p a 2 (Xi) = p{l-p) corr(Xj, Xj) = Pij = r 1 < i, j < n 

( 1 ) 

(1) In the simulation we used a system [Xi] introduced by Eunn and Davis 
(1998). How is it defined? 

(2) Apart from the previous one, are there other {2Q} systems which share the 
properties defined in (1)? 

It will be seen that it is only for n = 2 that the correlation matrix uniquely defines 
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the probability distribution. In other words, the set of random variables introduced 
by Lunn and Davies is only one among many possible answers. 

The set { X z } proposed by Lunn and Davies (1998) 

It is based on the following result. (0,1; p) denotes a binary variable whose expecta¬ 
tion is p. 

Proposition The Y, L1 (l < i < n) are a set of independent binary random vari¬ 
ables (0,1 ;p). Z is also a binary variable (0,1 ;p). The Ui are a set of binary 
variables (0,1; y/r). The Y i: Z , Ui are all independent. 

Then the variables X t = (1 — Ui)Y % + U. t Z are a system of symmetric dependent 
binary variables ( 0 , l;p) whose correlation matrix is p 7y = r. 

The proof is easy and given in the paper. Let us add two remarks. 

• We used the expression “symmetric variables” to reflect the fact that all X i: play 
the same role. The expression “exchangeable variables” is often used with the same 
meaning. 

• The correlation matrix has only positive elements. This is of course imposed by 
the symmetry condition, < 0 and P 23 < 0 would imply P 13 > 0 , thus violating 
the symmetry requirement. 

In the following subsections we will be concerned with the question of uniqueness 
of the set of X % generated above. Needless to say, it is useful to know whether the 
Proposition gives the answer or only one among many. More precisely, the problem 
can be stated as follows. 

In order to specify the probability distribution of one X-, variable one needs h 
parameters. In addition, the correlation matrix contains one parameter. Are 
these h\ = h + 1 parameters sufficient to determine the joint probability distri¬ 
bution of the whole set { Xi , i = 1 ,..., n}. 

Uniqueness for two binary dependent variables 

The definition (1) contains only two parameters, namely p and r. In this case hi = 2. 

It will be seen that many systems (indeed an infinite number) of variables can be 
defined which fulfill these conditions. The only exception is the case n = 2. In this 
case the correlation matrix defines completely the probability distribution. 

For two variables X \, X^ the probability distribution function is completely defined 
by two joint probabilities: 

Pn = P{X 1 = 1 and X 2 = 1}, p 0 i = P 10 = ^{^1 = 1 and X 2 = 0} then: p 00 = l-pu-2p 0l 

Thus, through an argument based on degrees of freedom, one sees that there is a one- 
to-one correspondence between the parameters p, r and the probabilities pn,poi- 

Non uniqueness for more than two binary variables 
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On the contrary, a system of 3 variables is defined by 3 probabilities, namely 

Pm, Pou, Pool, with: pooo = 1 - Pm — 3pon — 3pooi 

Thus, one of these 3 numbers can be chosen arbitrarily. There is no longer a one-to- 
one correspondence with the parameters p, r. 

The same property can be seen in a slightly different way. 

Clearly, E(X 1 X 2 ) = 1 x 1 x pn\ similarly poi = E[{1 — X\)Xq\ = p — E{X 1 X 2 ). 

This shows that the probabilities which define the distribution are specified by E(X\X 2 ) 
that is to say basically by the correlation matrix. As a matter of fact, an easy calcu¬ 
lation gives the expressions of the probabilities (q = 1 — p)\ 

pn = r 2 pq + p 2 , p 01 = p 10 = pq(l - r 2 ), p 0 o = r 2 pq + q 2 

On the contrary, the expression of pm, namely pm = E{X 1 X 2 X 3 ), shows that for 
3 binary variables one needs a three-point moment. The two-point moment of the 
correlation matrix is no longer sufficient. 

Non uniqueness for all non-binary variables 

The uniqueness property for the two-variable case is special to binary variables. 

If X,j can take 3 values, say 0,1,2, its probability function is defined by two numbers, 
for instance po, Pi, with P 2 = 1 — po — p\. Thus, h\ = 3. However, for two variables 
one needs 4 numbers to define the symmetric joint probability distribution namely: 

Poo, P 10 , P 20 , P 11 , with: p 22 = 1 - p 00 - 2pi 0 - 2p 20 

Thus, there is no uniqueness. 
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